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ABSTRACT 
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Abstract 

The competency testing of teacher candidates has become almost universal in the United 
States. The practice has its origin in concern that some of those who choose to pursue teaching 
careers in the elementary and secondary schools may lack the competencies the effective teaching 
requires. Stated in terms of decision errors, the fear was that some of those moving into full-time 
teaching may represent a population of false positives, teachers who lack necessary skills and 
abilities but are certified nevertheless. The contingencies under which the competency test is 
actually administered in California, however, suggests greater worry over the possibility of false 
negatives, a concern that some of those who are denied certification may actually be qualified. 
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The ubiquitous element in college and university admissions procedures and in 
certification activities, is some sort of testing. Although employing test scores as a selection 
criterion is often criticized and major educational research organizations uniformly condemn 
using a test score as a sole selection criterion, test scores remain a fixture in many selection 
decisions. Many institutions, particularly colleges and universities, face the need to classify and 
screen large numbers of applicants, and do so with accuracy and efficiency. 

The value of test data, of course, is in what they reveal about the level of a candidate s 
aptitude, ability, or command of some essential skill or knowledge. Because mental traits, 
abilities, and knowledge cannot be measured directly, quantifying them involves some risk of 
measurement error. Errors in mental measurement increase the potential for errors in the 
decisions that are made regarding who have the requisite level of the measured characteristic, 
who will most likely succeed in advanced study, or who will become the most successful 
candidates. The point of this paper is to examine some of the issues associated with decision 
error, particularly as they relate to teacher certification in California, although they generalize to 
many other selection and screening situations. 

Error in Assessment 

The point of collecting candidate data is to identify what is relevant to decisions about 
candidates’ qualifications. This requires the predetermination of some sort of reference to which 
assessment data can be compared in order to reach a decision. When what the candidate must 
know or be able to do can be specified in discrete terms minimal competencies are defined. They 
may take the form of formal standards or some less formal set of criteria. The processes for 

determining them are the domain of standard setting . 1 

Particularly when it is difficult to determine the minimal level of some required ability 
(What level of reasoning is sufficient?), or when there are greater numbers of qualified applicants 
than there is opportunity to accommodate them, a normative reference is usually adopted. Norm- 
references specify as a required standard, some ranking within the group of all who fit a defined 
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category; ‘the top 20% of high school seniors,’ or ‘applicants scoring in the upper quartile,’ are 
examples. 

Whether the standard is normative or criterion-based, data are gathered so that an 
informed decision can be made regarding which candidates qualify. Both approaches impose the 
risk of error, but the risk may be greater when objective criteria are used since the difficulty of 
accurate assessment of the candidate’s ability can be compounded by any error that might occur 
in identifying the appropriate standard. In spite of the increasing popularity of assessment 
grounded in authentic outcomes, assessment procedures can rarely be truly comprehensive. 
Inferences must be drawn from a sample of candidates’ responses regarding the larger universe of 
their skills, abilities, and knowledge. Furthermore, there may be some component of what is 
measured that has its origin in a characteristic other than the characteristic of interest. The 
students’ command of the language becomes confused with their analytical ability. Their talent 
for calculation becomes a component of what is inadvertently termed problem-solving ability. 
Such occurrences increase the potential for decision errors when candidates are classified. 

Components of a score that are irrelevant to the named construct reflect measurement 
error. Because problem-solving, for example, cannot be assessed the way one assesses height, or 
age, there is a great potential for measurement error. Lucky guesses on a multiple choice exam, a 
grader's scoring mistakes, a level of test-wiseness that allows test-takers to detect correct 
responses when they lack the construct which is the focus of the exercise, even the candidate s 
health can contribute to measurement error. These situations compound the effects of errors in 
measurement that might occur because the test items are of poor quality or insufficient quantity, 
because test-takers don’t understand the directions, or because the data are tabulated improperly. 
Any of these factors may inflate, or deflate a score and in doing so, they may give rise to decision 

errors. 
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Decision Errors 

In an ideal assessment world where there is no error, the population of competent 
candidates would be entirely separate from those who are incompetent (Figure la). This is 
actually the inference when a cut-off score separates those who are competent from those who 
are not. But because of measurement error, the scoring distributions of the competent and the 
incompetent can overlap, giving rise to decision errors. 

Decision errors fall into two categories. False positive errors, also called alpha (a) errors 
or type I errors, occur when one who lacks the relevant characteristic is judged to possess it. 

Some of those judged to be competent are actually not competent — often those at the lower end 
of their distribution (Figure lb). False negative errors, also called beta (/Oor type II errors, occur 
when candidates who possess a necessary characteristic at the level required are judged instead to 
be wanting. Some of those judged to be not competent are actually competent typically those 
at the upper end of that distribution. 



Place Figures la and lb About Here 



While error inflates and deflates the scores of members of either group, only those scores 
nearest the cut-off score are of concern, since they represent the candidates who are most likely 
to be misjudged. In particular, errors that deflate the scores of those candidates in the lowest 
ranks of the competent and the errors that inflate the scores of those in the highest ranks of the 
incompetent provide the greatest probability of misclassification. 



The Horns of a Dilemma 

Part of the difficulty in making candidate decisions is that false positive and false 
negative classifications are inherently related. The number of false negatives can be minimized 
by lowering the cut-off score, but the companion will be more false positive classifications. 
Correspondingly, false positives can be minimized by raising the cut-score, but only by 
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increasing the likelihood of false negative decisions. Therefore, a critical element in standard- 
setting procedures, is deciding which of false positive or false negative errors pose the greater 

threat. 

Choosing One’s Error 

Which of type I or type II error is more acceptable depends upon circumstances and is 
manifest in the rigor of the standard. Those judging the competency of prospective surgeons are 
more willing to accommodate type H than type I errors. The safety of prospective patients 
requires that standards be fixed relatively high. On the other hand, when landed immigrants seek 
naturalization as United States citizens they are given an oral examination of their understanding 
of U.S. government. Decision-makers may determine that the occasional false positive 
classification (the individual really does not quite understand the system) is preferable to denying 
someone citizenship who has a reasonable understanding of the political system, but does not 
speak English well enough to respond to all of the nuances of the questions. 

In many other instances, the issues are less clear-cut and questions about which type of 
error is more palatable are grounded in politics as much as in measurement issues. But the 
preference for one type of error over another is implicit in the rigor of the standard. Teacher 
certification in California represents a case-in-point. Twenty years ago, much of the impetus for 
educational reform came from criticisms that teacher candidates are among the least 
academically talented of university students (Nelli, 1984; Weaver, 1981). In the wake of the 
charges, nearly every state adopted competency tests with particular attention on candidates’ 

academic, rather than their pedagogical ability. 

As long as concern about teacher competency continues to be a driving force behind the 
use of screening instruments (Barth, 2000; Hextall, Mahoney, and Menter, 2001, Song and 
Christiansen, 2001) the suggestion is that false negative errors are less objectionable. For the 
sake of the learner, it would seem better to occasionally exclude someone who may actually 
possess the required literacy skills, than provide a credential to someone who does not. 
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Besides being less objectionable from the standpoint of educational reform objectives, 
there is a compensation built in that allows the possibility of correcting a false negative decision: 
the test-taker may repeat the test. Besides allowing for a decision error to be rectified, it also 
allows a second chance opportunity to candidates who may have worked in the interim to 
develop the measured competencies. On retesting, candidates may be able to meet standards they 
failed to meet in an initial trial. 

Although allowing repeated attempts to pass a competency test can correct for false 
negatives, what adjustment is made for the continuing possibility of a false positive? One option 
is to adjust the cut-off score with each repetition of the test. Millman (1989) explained that when 
the required score remains fixed, the risk of a false positive decision over successive 
administrations of the test accumulates to the point that, as repetitions increase, there is a very 
high risk of false positive classifications. In a practical demonstration of this potential, Huynh 
(1990) used high school exit exam data to show that multiple retakes with a fixed cut-off score 
reduce false negatives to near 0 and false positives to a probability of nearly 1 with repeated 
attempts at the test. 

Although the very appearance of competency tests is witness to concerns that some of 
those who teach are not competent (that is, they represent false positive classification errors), a 
fixed cut-off score for the CBEST suggests that the original decision may have been modified. 
The criterion is the same for repeaters as it is for first time test-takers, in spite of the fact that 

repeaters have seen some version of the instrument and have greater familiarity with the testing 

2 

procedure than those who are taking the test for the first time. 

Two other elements of CBEST procedure underscore this position. First: Only the failed 
portion need be repeated suggesting that although a false negative can be corrected by repeating 
the test, there is no parallel opportunity to correct for a false positive. Second: In spite of the fact 
that the three sub-tests are intended to provide independent measures of discrete skills, to some 
degree, higher-than-minimum scores in one area are allowed to compensate for scores below 
criterion in another. 
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Implications 

Whether intended for those who aspire to teach, to the bar, or to graduation from high 
school, assessments provide the data from which classification decisions are made. Although our 
focus has been on testing data and classroom teacher certification, any type of measurement data 
is subject to error which places the resulting classification decisions at risk. Decision errors are 
related in that one type can only be minimized by increasing the potential for the other. The type 
that is less objectionable depends upon circumstances. When there is a substantial risk to some 
innocent party, for example, the tendency is to favor false negatives. However, the education 
reform movement suggests that there is sometimes a discontinuity between the rationale for the 
assessment and the mechanics of its implementation. Teacher competency tests were imposed 
because of concerns that teachers lacked basic literacy skills. Expressed in terms of decision 
errors, the concern was that some of those who become credentialed teachers represented false 
positive classifications and the point of the testing appears to have been to adjust selection 
procedures to be more tolerant of false negatives. But the conditions of test administration and 
data interpretation indicate a relatively greater sensitivity to false positive decisions. 

At this point it seems unlikely that procedures will be adjusted to compensate for what 
are almost certainly significant numbers of false positive classifications, but such procedures are 
certainly available. Besides sliding the cut-off score, Millman (1989) has noted that it is 
common to average the repeat scores with the original score, or designate a "no-decision" band in 
the area where most decision errors are made. 

In teacher candidates we have an assessment situation in which procedures have been 
adjusted to minimize false negatives. Although their discussion is beyond the scope of this 
paper, teacher shortages, and scoring patterns that correlate with the candidate’s ethnic group 
appear to be the driving issues in those decisions. Competency testing may allow decision 
makers to determine whether candidates have the skills and understandings that reflect literacy, 
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but what appears to be an institutional bias favoring false positive decision errors may undermine 
the original intent. 
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Footnote 

'The processes employed for determining a criterion which reflects the division between passing 
and failing is an important issue which is beyond the scope of this paper. For those with an 
interest in standard setting, papers by Geisinger (1991), Reid (1991), and Plake (1991) all in the 
same issue of Educational Measurement: Issues and Practice will be helpful. 

2 Test-takers have seized the moment. It isn’t uncommon for those failing CBEST to take it 
several times. 
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Figure la, lb 

The Competent and the Not Competent 
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